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SYSTEM AND METHOD FOR CLASSIFYING 
ELECTRONICALLY POSTED DOCUMENTS 

BACKGROUND OF THE INVENTION 

1 . Field of the Invention 

This invention relates generally to systems and methods for comparing and classifying 
documents, and in particular to systems and methods for classifying electronically posted documents 
used in conjimction with search engines. 

2. Description of the Related Art 

The Intemet, a global network connecting millions of computers, is increasingly becoming 
the preferred way to disseminate information. An estimated 150 million people worldwide use the 
Internet to access and exchange information. 

Both conmiercial and non-commercial entities have recognized the growing use of the 
Internet and have thus accelerated the posting of "electronic documents" to provide access to their 
information. As known, "electronically posted documents" ("documents," herein) may contain any 
type of information which can be electronically communicated. These documents, typically web 
pages, are posted on the world wide web, a system of internet-accessible web servers. Individual 
companies set up one or more web sites using a web server to support web page publication and 
communication. Some examples of information which can be included in an electronic document 
such as a web page includes data, text, facsimile, audio, video, graphics, as well as other types of 
information. 

In many instances, the user may not know the web site location (URL address) which 
contains the desired information. Altematively, the user may prefer to browse similar information 
obtained from a variety of different web sites. In these cases, the user may employ a search engine 
to locate one or more web pages containing information about the desired topic. 

Conventional search engines, such as Yahoo®, Alta Vista® and Excite® use several programs 
to retrieve web pages containing the requested information. Typically, a "spider" or "webcrawler" 
program is used to locate and download posted documents. Once downloaded, an "indexer" 
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program reads the docximents and creates an index based on the words contained in each document. 
Upon entry of one or more of the indexed keywords, the search engine provides to the requester a 
listing of the search results, typically in the form of HTML links, each listing corresponding to one 
of the indexed documents. The user may then click on one of the displayed HTML links to access 
5 information on a particular web page. Each provider's search engine typically uses proprietary 
webcrawler and indexing programs which locate and return the most comprehensive set of 
documents in the shortest amount of time. 

A problem associated with the aforementioned process is the listing of duplicate documents 
in the search results. Duplications inconvenience the user by directing him/her to seemingly distinct 
1© documents which, in fact, contain identical content. ' \ 

[h To minimize the occurrence of duplicate listings, a textual conq)arison process was 

f J developed by which the text content of two dovraloaded or listed documents is compared. If the text 
O of the two documents match, the documents are deemed duplicative and one could then be discarded 
fg without loss of information. 

One disadvantage of the conventional textual comparison process is that it performs a pair- 
y j wise document comparison process on a non-selective basis. For example, the conventional textual 

comparison process will compare documents of different mime-types which are inherently 
^3 dissimilar. Performing these unnecessary document comparisons lengthen the system's response 
time. Another disadvantage of the conventional process is that it does not ensure elimination of 
20 content-duplicate listings. Documents which contain identical content but which include different 
attributes (such as metadata "href elements), are typically identified as different documents using 
the conventional textual comparison process. These documents in fact are content-identical and 
provide no additional information to the searcher. 

In view of the disadvantages suffered by the conventional system and process, a new system 
25 and method for classifying posted documents is needed. 

SUMMARY OF THE INVENTION 

The present invention provides new systems and methods for efficiently classifying 
electronically posted documents. The classification process employs a multi-tiered comparison 
30 process in which portions of corresponding metadata summaries are compared at the structural. 
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attribute, and text level. This comparison process provides a fast and accurate means of determining 
if two posted documents are duplicative or distinct. 

In one embodiment of the invention, a method for classifying posted documents is presented 
which includes the processes of receiving two posted documents and generating corresponding 
metadata summaries for each, wherein each of the metadata summaries includes at least one sub-tree 
structure. The structures of the two summary sub-trees within the respective metadata summaries are 
subsequently compared. If the two summary sub-trees are different, the two documents are deemed 
distinct. 

In another embodiment of the invention, a system for classifying posted documents is 
presented. The system includes a metadata parser module, a summary repository, and a summary 
consolidator. The metadata parser module receives electronically posted documents and in response 
outputs respective metadata summaries, wherein each of the respective metadata summaries include 
one or more sub-trees structures, one or more attributes, and content text. The summary repository is 
coupled to receive and store the respective metadata summaries. The summary consolidator is 
coupled to the summary repository and is configured to delete duplicate metadata summaries firom 
the summary repository. 

Other embodiments of the present invention will be gleaned from a study of the following 
drawings and detailed description of the preferred embodiments. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 A is a block diagram of an exemplary posted document classification system in 
accordance with the present invention. 

Fig. IB illustrates a simplified block diagram of programming modules used in executing the 
method of the present invention. 

Fig, 2A illustrates a XML/RDF metadata summary generated by the metadata parser module 
in accordance with one embodiment of the present invention. 

Fig. 2B illustrates a graphical mapping of the metadata summary shown in Fig. 2A in 
accordance with one embodiment of the present invention. 

Fig. 3 illustrates a method for classifying posted web pages in accordance with one 
embodiment of the present invention. 
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Fig. 4 illustrates a method for selecting metadata summaries in accordance with the present 
invention. 

DESCRIPTION OF THE PREFERRED EMBODIMENT 

5 Fig. 1 A is a block diagram of an exemplary posted document classification system 1 00 in 

accordance with the present invention. The system 100 includes a document posting device 1 10 
coupled to a computational device 150 via a communication Unk 130. In one embodiment, 
document posting device 110 may be an internet-accessible web server and the communication link 
130 may be a hardv^red or wireless TCP/IP intemet connection. In an alternative embodiment, the 
posted document device 110 may be incorporated within the computational device 1 §0 itself 
m The computational device 150 inchides a network interface connection 151, a CPU 152, an 

yj input/output device 153, such as a keyboard and monitor, and a main memory 154 for storing data 
^ and programming instructions. Other computer conqjonents such as a disk drive 1 55, configured to 
CO accept a magnetic floppy disk 157, and a direct access storage device (DASD) 156 for storing data 
Igi and programming may also be included. Data and/or program instructions may be stored on the 
^ computer-readable medium 1 57, in which case the reader 155 reads and communicates the data 

and/or programming instructions to the main memory 1 54. 
1:3 Fig. IB illustrates a simplified block diagram of the main memory 154 in which 

programming modules reside for executing the method of the present invention. Included within 
20 main memory 154 is a web crawler module 160, a metadata parser module 165, a summary 
repository 170, a search engine module 175, and a summary consoUdator module 180. 

The web crawler module 160 searches and retrieves, via the network interface 151 and the 
communication link 130, electronically posted documents from posted document device 110. The 
retrieved docxmients may be stored in the main memory 1 54 or in the DASD 1 56. The metadata 
25 parser module 165 receives the downloaded documents and generates a metadata summary which is 
organized into one or more sub-tree structures in which attributes and/or text content is contained. 
An exemplary embodiment of a metadata summary is shown in Fig, 2 A, further described below. 
The generated metadata summary is stored in the summary repository 170, which is preferably a 
database configured to store the generated metadata summaries. The summary repository 170 may 
30 reside partially or entirely within the main memory 1 54 or the DASD 1 56. 
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The search engine module 175 is in communication with the summary repository and preferably 
includes a document indexer/user interface operable to provide information contained within one or 
more of the stored metadata summaries in response to receiving the user's entry of specific 
keywords. In a specific embodiment, the search engine 175 may include any of the aforementioned 

5 commercially available search engines. In alternative embodiments, the search engine may be a 
specially designed search engine for indexing metadata summaries and retrieving information such 
as html links contained therein in response to the user's entry of specific keywords. 

The summary consolidator module 180 is in communication with the summary repository 
and further includes sub-tree comparator 182, attribute comparator 184, and value comparator 186. 
1|3 As will be further described below, the summary consolidator module 1 80 selects metadata 

U summaries from the summary repository 1 70, and compares them on a structural, attribute, and 
textual level to determine if the posted documents to which the compared summaries correspond are 

O duplicates. If the metadata summaries are determined to be diq)licates, the duplicate metadata 

f g summary is removed so that the sxxmmary repository 170 only stores distinct metadata summaries 
IS^ corresponding to distinct documents. 

[ U Fig. 2 A ilhistrates one embodiment of a metadata summary 200 generated by the metadata 

^2 parser module 165. The metadata summary 200 is ilhistrated in resource description framework 
J;:: (RDF), although other formats or languages, such as attribute-value pairs, as well as others may be 
used in altemative embodiments. 
20 The metadata summary 200 includes three portions which summarize the data contained in 

its corresponding web page: a data gatherer portion 210, a metadata portion 220, and a datasource 
portion 230. The data gatherer portion 210 includes information about the data gatherer, such as the 
assigned title of the data gatherer and date of gathering. The metadata portion 220 includes 
information about the web page source, such as the date of web page update and the document's 
25 mime-type. The datasource portion 230 includes information about the web page data itself, or the 
metadata proper. This portion may include information such as the web page's title, abstract, 
presentation format, encoding, textual content, applets, scripts, embedded images and other 
multimedia, information about out-Hnks and/or in-links, as well as other metadata. 

In the illustrated embodiment of Fig. 2A, the datasource portion 230 includes three attributes 
30 231, 232, and 233 and two sub-parts (sub-trees) 234 and 236. Attribute 231 includes an attribute 
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name: "html-title," and an attribute value: "Jane's Homepage." Attributes 232 and 233 list 
additional attribute names and corresponding values, respectively. 

Attribute names, attribute values, and text content are stored v^ithin "bags" and "list items" 
nested within sub-tree structures 234 and 236. As knov^n in the art, the terms "Bag," "LI," and 
5 "Description" are RDF structural constructs which the sunmiary consoUdator module 1 80 recognizes 
in the document grouping process, further described below. 

In the illustrated embodiment, the sub-tree 234 includes a "bag" (rdf:Bag) in which a first 
"Hst item" (rdf:LI) contains a first ref attribute 234a and a first ref annotation 234b. The "bag" also 
includes a second "list item" (rdf:LI) containing a second ref attribute 234c and a ref annotation 
^ 234d. As known in the art, the ref attributes 234a and 234c indicate the destination of the HTML 

yl link (out-link) when activated. The annotation attributes 234b and 234d give the text associated with 

p , 

Ij the out-link. In the illustrated embodiment, the first ref attribute 234a has a value of 

"http://www.yahoo.cora/," and the second ref attribute 234c has a value of 
vJ "http://www.people.com/jane doe/my_photo.jpg." The first ref annotation 234b has a value of 
lg3 'Tahoo!" and the second ref annotation 234d has a value of '^picture of me." Of course, additional 
^ji ref attributes and aimotations may be used having similar as well as different values in alternative 
embodiments. 

til 

G Sub-tree 236 defines a presentation description attribute. As known in the art, the 

presentation description contains textual content of the HTML page viewable through a world wide 

20 web browser. In the illustrated embodiment, the sub-tree 236 includes a "bag" (rdfiBag) having a 
first "list item" (rdf:LI) containing the textual content: "Welcome to my homepage. " The **bag" also 
includes a second "list item" containing the textual content *TJse Yahoo! to search for something or 
look at a picture of me," 

Fig. 2B represents a graphical mapping of the metadata summary 200. The data gather 

25 portion 210 and the metadata portion 220 are included under a first RDF description node which 
includes attributes as shovra in the sxunmaiy 200. A second RDF description node defines the 
datasource portion 230 and includes metadata attributes 231, 232, and 233, as well as metadata sub- 
trees 234 and 236. Metadata sub-tree 234 (ref-annotations) includes several nodes; specifically, an 
RDF "bag" (rdfiBag) which includes two RDF "list items" (rdf:LI), each including an RDF 

30 "description" (rdfiDescription). Each RDF "description" has two attributes named "ref and 
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"annotation." The metadata subtree 236 (presentation text) also includes several nodes; particularly 
an RDF "bag" node which itself includes two RDF "Ust item" nodes, each of which include text 
content. As will be further explained below, by comparing the structures of the metadata summaries, 
and in particular, the sub-structures of their metadata portions, summaries can be classified faster. 

Fig. 3 illustrates a method for classifying documents, such as posted web pages, in 
accordance with the present invention. Initially at 302, posted documents are retrieved by the 
system. This process is preferably performed using the web crawler module described above. Next 
at 304, the metadata parser module reads the downloaded document and generates a metadata 
summary which summarizes the web page's content and structure as depicted in Fig. 2A. The 
metadata summary is stored in the summary repository until accessed by the search engine module, 
as described above. The web crawler may follow links such as hypertext links associated with web 
pages or other documents as it circulates through the collection of posted documents. 

At 306, the metadata summaries are collected into x different summary groups, each 
summary group containing summaries having a particular attribute-type. In the illustrated 
embodiment, summaries having the same mime-type are grouped. In this embodiment, a first 
summary group labeled F may include n summaries of .gif files (f^, fj, fg, . . . Q, .txt file summaries 
may be placed into a second siunmary group, html file summaries placed into a third summary 
group, and Java file summaries placed into a fourth summary group. Of course, metadata summaries 
corresponding to other mime-type files may also be received and grouped as well. 

Those of skill in the art will appreciate that other file attributes may be used as the grouping 
criteria either alternatively or in addition to the file mime-type. For instance, the document's 
content-length may be used as a grouping criteria either independently or in combination with the 
document's mime-type. Other file attributes may be used as a grouping criteria as well. 

Next at 3 10, an equivalence metadata table (EMT) is generated for each summary group to 
record the equivalence state between compared metadata sximmaries. In a preferred embodiment in 
which a summary group includes 4 metadata summaries, the EMT is a two-4imensional matrix of 4 
rows by ^ columns, each off-diagonal entry indicating the equivalence state between the 
corresponding rows and columns. In the preferred embodiment, a 0 is entered if the two intersecting 
metadata summaries are found to be distinct, and a 1 is entered if the two summaries are found to be 
duplicative, as further described below. 
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At 315, a summary group is chosen and two summaries contained therein are selected for 
comparison. By grouping similar mime-type files, unnecessary file comparisons, e.g., comparisons 
between .txt and .gif files are avoided, thereby accelerating the classification process. An 
embodiment of this process is further illustrated in Fig. 4. 
5 At 320, the sub-tree structures of the selected metadata sunamaries are compared. As 

explained above, each metadata summary includes a metadata portion 230 (Fig. 2) having one or 
more sub-tree structures, each sub-tree having one or more nodes. In the preferred embodiment, the 
structural comparison process includes comparing the sub-tree structures of the metadata portion to 
determine equivalence. In an alternative embodiment, one or more sub-tree structures external to the 
1© metadata portion are compared either altematively, or in addition to, the metadata portion sub-trees, 
iri At 325, a determination is made as to whether the structures of the compared sub-trees are 

I equivalent. If not, the first and second metadata sunamaries and their corresponding documents (web 
\t pages) are identified as distinct. This process is performed in one embodiment by entering a 0 into 
£3 the aforementioned equivalence metadata table at the appropriate entry location. Both metadata 
IS j summaries are subsequently returned to the summary group and the classification process continues 
; at 320 where a subsequent comparison is initiated. By comparing the metadata simmiaries initially 
^ on a structural level, the time needed to classify documents as distinct is significantly reduced 
f 3 compared to the conventional textual comparison process. 

If the structures of the first and second summary sub-trees are determined to be equivalent, 
20 the process continues at 335, where the attribute values within the metadata portion sub-tree are 

compared. The attribute value comparison process may include locating the attribute title within the 
appropriate sub-tree and storing its corresponding attribute value. The stored attribute values are 
subsequently compared and their equivalence determined at 340. If the attribute values are not 
equivalent, the first and second metadata summaries and their corresponding documents are 
25 identified as distinct. This process is performed in one embodiment by entering a 0 into the 
aforementioned equivalence metadata table at the appropriate entry location. Both metadata 
summaries are subsequently returned to the summary group. 

If the compared attribute values are equivalent, the process continues at 345, where the text 
located within the selected sub-tree structures is compared. The text comparison process may 
30 include locating and storing text and comparing the stored text of the two selected summaries. As 
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can be seen, the attribute and text comparison process in the present invention is performed only 
over a portion of the total text of the document, greatly reducing the amount of time needed to 
compare the documents. The process at 320, 335 and 345 are preferably executed using the above- 
described summary consolidator module 180 shown in Fig. IB. In particular, sub-tree comparator 
182, attribute comparator 184, and text comparator 186 may be used to perform the processes of 320, 
335, and 345, respectively. 

If the text comparison process indicates that the documents contain identical text, the two 
metadata summaries and their corresponding web pages are identified as duplicates. This process is 
performed in one embodiment by entering a 1 into the aforementioned equivalence metadata table at 
the appropriate entry location. In one embodiment, one of the duplicative metadata summaries is 
removed and the summary group consolidated. A log which indicates some of the removed 
summary's attributes (such as tiie URL, date, etc.) may be made. 

The aforementioned steps are repeated until each of the metadata summaries in all of the 
summary groups are compared and their equivalence states are entered into the corresponding EMT. 
At flie conclusion of the process, a set of EMTs store data which indicate the equivalence states of 
the retrieved documents. In addition, the summary repository is consolidated into an "ordered 
summary repository" which stores only those metadata summaries corresponding to content-unique 
documents. 

The process illustrated in Fig. 3 may be repeated to retrieve and compare newly dovmloaded 
docxmients. For instance, a new document may be downloaded and subsequently placed into a 
summary group of the same mime-type. The comparison process is subsequently performed and an 
EMT is generated 

The ordered metadata repository and the EMTs may be used in a variety of ways to obtain 
useful information about the summarized documents. In one embodiment, the ordered metadata 
repository may be used as a search engine source fi:om which content-unique docimients are 
produced in response to entered keywords. In another embodiment, the EMTs may be accessed to 
show all of the documents which contain the same text as a qualifying search result entry. In another 
embodiment, the search engine may query the user for additional selection criteria, such as the 
desired date, or URL in order to choose between two identified duplicative documents. 



AM999074 



10 



Fig. 4 illustrates one embodiment of the processes shown in 315 for selecting metadata 
summaries in accordance with the present invention. Initially at 315a similar mime-type files are 
ordered within the group as described above. For instance, F includes n metadata summaries {f^, fj, 
f3,.. f„} received in the repository which are summaries of .txt files. Other mime-type groups may 
also be included. 

At 315b, one of the m* groups, for instance the F group, is selected for comparison. Next at 
315c, a reference summary fj is selected and the summary's sub-tree is mapped. During the first 
iteration of the process, i=l and the first metadata summary is selected and sub-graphed as the 
reference summary against which the remaining sunmiaries will be compared. 

Next at 3 15d a secondary summary ^, is selected and its sub-tree is mapped. Jn the preferred 
embodiment, j>i, i.e., during the first iteration, the second metadata summary of the group is selected 
and its sub-tree structure mapped. Subsequently, the reference and secondary summaries ^ and are 
compared as described above in steps 320-360. 

Once the comparison has been performed, a determination at 3 15e is made as to whether j is 
equal to n, i.e., whether the last sxmmiary within the selected group has been compared to the 
reference summary ^. If not, j is incremented at 3 1 5f and the process retums to 3 1 5d where the next 
summary within the same group is selected and sub-tree structure compared to primary simimary 1^. 
If j=n indicating that all of the i-1 summaries have been compared to fj, the process continues at 315g 
where a determination is made as to whether i=n-l. 

If at 31 5g, a determination is made that I is not equal to n-1, the process continues at 315h 
where I is incremented, thereby selecting the next file as the reference file to which all of the 
subsequent files will be compared. If at 31 5g, n is determined to be equal to i-1, then all of the 
summaries have been compared against each other and a different group may be selected. At 3 15i, a 
determination is made as to whether the group index m is equal to the x, indicating the last group. If 
not, the group index m is incremented at 315k and the process continues at 3 15b. If m=x, all of the 
groups have been compared and the classification process is complete. 

The present invention has now been described in terms of the exemplary embodiments. 
Those of skill in the art will appreciate that various modifications and alterations may be made while 
still remaining within the present invention, the scope of which is legally defined as the metes and 
boundaries of the following claims: 
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CLAIMS 

WE CLAIM : 

1 LA method for classifying electronically posted documents, the method comprising: 

2 receiving a first document and a second document; 

3 generating a first metadata summary corresponding to said first document and a 

4 second metadata summary corresponding to the second document, wherein the first metadata 

5 summary includes a first summary sub-tree and the second metadata summary includes a second 
summary sub-tree; 

fi comparing the structure of the first summary sub-tree with the structure of the second 

summary sub-tree; and 

y 

?3 identifying the first and second documents as distinct if the structures of the first and 

l^f^ second summary sub-trees are not equivalent. 

|: j 2. The method of claim 1, wherein the first summary sub-tree includes at least one 

W attribute having a first attribute value, and wherein the second simimary sub-tree includes at least one 

£J attribute having a second attribute value, the method fiirther comprising: 

comparing, for each of the at least one attributes, the first and second attribute values; 

5 and 

6 identifying the first and second documents as distinct if the attribute values of the first 

7 and second summary sub-trees are not equivalent. 

1 3 . The method of claim 1 , wherein the first summary sub-tree includes text content, and 

2 wherein the second summary sub-tree includes text content, the method further comprising: 

3 comparing the text content included within the first and second summary sub-trees; 

4 and 

5 identifying the first and second documents as distinct if the text content of the first 

6 and second summary sub-trees are not equivalent. 
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4. The method of claim 2, wherein the first summary sub-tree further includes text 
content, and wherein the second summary sub-tree includes text content, the method further 
comprising: 

comparing the text content included within the first and second summary sub-trees; 

and 

identifying the first and second documents as distinct if the text content included 
within the first and second summary sub-trees are not equivalent. 

5. The method of claim 4, further comprising identifying the first and second documents 
as duplicates if the text content within the first and second summary sub-trees krc equivalent. 

6. The method of claim 5, further comprising removing the second metadata summary 
firom the first smnmary group if the structures of the first and second summary sub-trees are 
equivalent and if the first summary value is equivalent to the second summary value for each of the 
at least one attributes. 

7. The method of claim 1, further comprising: 
defining a first equivalence metadata table comprising: 

a first row corresponding to the first metadata summary; 

a second row corresponding to the second metadata summary; 

a first colunrn corresponding to the first metadata summary; and 

a second colimm corresponding to the second metadata summary, wherein the 
process of identifying the first and second documents as distinct if the structures of the first and 
second summary sub-trees are not equivalent comprises storing a zero binary value in the first row 
and second column position of the equivalence metadata summary. 

8. The method of claim 2, further comprising: 
defining a first equivalence metadata table comprising: 

a first row corresponding to the first metadata summary; 

a second row corresponding to the second metadata summary; 
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a first column corresponding to the first metadata summary; and 
a second column corresponding to the second metadata summary, wherein the 
process of identifying the first and second documents as distinct if the attribute values of the first and 
second summary sub-trees are not equivalent comprises storing a zero binary value in the first row 
and second column position of the equivalence metadata summary. 

9. The method of claim 3, further comprising: 
defining a first equivalence metadata table comprising: 

a first row corresponding to the first metadata sunmiary; 

a second row corresponding to the second metadata summary; 

a first column corresponding to the first metadata summary; and 

a second column corresponding to the second metadata summary, wherein the 
prpcess of identifying the first md second documents as distinct if the text content of the first and 
second summary sub-trees are not equivalent comprises storing a zero binary value in the first row 
and second column position of the equivalence metadata sunmiary. 

10. A method for classifying electronically posted documents, the method comprising: 
receiving a plurahty of documents; 

generating a respective plurality of metadata summaries corresponding to the plurality 
of received documents; 

grouping a first subset of the respective plurality of metadata summaries into a fn^t 
summary group, the first summary group comprising summaries having a first mime-type 
designation; 

selecting a first metadata summary and a second metadata summary fi-om the first 
summary group, wherein the first metadata summary includes a first summary sub-tree and the 
second metadata summary includes a second summary sub-tree; 

comparing the structure of the first summary sub-tree with the structure of the second 
summary sub-tree; and 

identifying the first and second documents as distinct if the structures of the first and 
second summary sub-trees are not equivalent. 
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1 11. The method of claim 1 0, wherein grouping further comprises grouping a second 

2 subset of the respective metadata summaries into a second summary group, the second summary 

3 group comprising summaries having a second mime-type designation. 

1 12. A system for classifying electronically posted documents, the system comprising: 

2 a metadata parser module coupled to receive electronically posted documents, the 

3 metadata parser configured to output respective metadata summaries, v^herein each respective 

4 metadata summary comprises one or more sub-trees structures, one or more attributes, and content 

5 text; 

6 a summary repository coupled to receive and store the respective metadata 
% summaries; and * ^ 

a summary consolidator coupled to the summary repository, the summary 

m consolidator conjSgured to delete duplicate metadata summaries from the summary repository. 

|g 1 3 . The system of claim 12, v^^herein the smnmary consolidator comprises: 

% a sub-tree comparator configured to compare one or more sub-tree structures of the 

iJ retrieved metadata summaries; 

fy 

^ an attribute comparator configured to compare the attribute values of the retrieved 

I; metadata summaries; and 

6 a text comparator configured to compare the text content included within the retrieved 

7 metadata summaries, 

1 14. The system of claim 13, wherein the sub-tree comparator is configured to compare the 

2 metadata portion of the metadata summary. 

1 15. The system of claim 13, wherein the attribute comparator is configured to compare 

2 the attribute values included within the metadata portion of the metadata summary. 

1 1 6. The system of claim 13, wherein the text comparator is configured to compare the 

2 text content included within the metadata portion of the metadata summary. 
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17. A program product for use in a computer system that executes program steps recorded 
in a computer-readable media to perform a method for classifying electronically posted documents, 
the program product comprising: 

a recordable media; 

a program of computer-readable instructions executable by the computer system to 
perform processes comprising: 

receiving a first document and a second document; 

generating a first metadata summary corresponding to said first document and 
a second metadata summary corresponding to the second document, wherein the first metadata 
summary includes a first summary sub-tree and the second metadata summary includes a second 
summary sub-tree; 

comparing the structure of the first summary sub-tree with the structure of the 
second summary sub-tree; and 

identifying the first and second documents as distinct if the structures of the 
first and second summary sub-trees are not equivalent. 

18. The program product of claim 17, wherein the first summary sub-tree includes at least 
one attribute having a first attribute value, and wherein the second summaiy sub-tree includes at least 
one attribute having a second attribute value, the program product method fiirther comprising the 
processes of: 

comparing, for each of the at least one attributes, the first and second attribute values; 

and 

identifying the first and second documents as distinct if the attribute values of the first 
and second summary sub-trees are not equivalent. 

1 9. The program product of claim 1 8, wherein the first summary sub-tree includes text 
content, and wherein the second summary sub-tree includes text content, the program product further 
comprising the processes of: 

comparing the text content included within the first and second summary sub-trees; 

and 
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identifying the first and second documents as distinct if the text content of the first 
and second summary sub-trees are not equivalent.. 

20. The program product of claim 19, further comprising the method step of identifying 
the first and second documents as duplicates if the text content within the first and second summary 
sub-trees are equivalent. 

2 1 . The program product of claim 20, further comprising the process of removing the 
second metadata summary from the first summary group. 

22. The program product of claim 2 1 , further comprising the processes of: 
defining a first equivalence metadata table comprising: 

a first rov^ corresponding to the first metadata summary; 

a second rov^ corresponding to the second metadata summary; 

a first column corresponding to the first metadata summary; and 

a second column corresponding to the second metadata summary, wherein the 
process of identifying the first and second documents as distinct if the text content of the first and 
second summary sub-trees are not equivalent comprises storing a zero binary value in the first row 
and second column position of the equivalence metadata summary. 

23. The method of claim 18, further comprising the processes of: 
defining a first equivalence metadata table comprising: 

a first row corresponding to the first metadata summary; 

a second row corresponding to the second metadata summary; 

a first column corresponding to the first metadata summary; and 

a second column corresponding to the second metadata summary, wherein the 
process of identifying the first and second documents as distinct if the attribute values of the first and 
second summary sub-trees are not equivalent comprises storing a zero binary value in the first row 
and second column position of the equivalence metadata summary. 

24. The method of claim 19, further comprising the processes of: 
defining a first equivalence metadata table comprising: 
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a first row corresponding to the first metadata summary; 

a second row corresponding to the second metadata summary; 

a first column corresponding to the first metadata summary; and 

a second column corresponding to the second metadata summary, wherein the 
process of identifying the first and second documents as distinct if the text content of the first and 
second summary sub-trees are not equivalent comprises storing a zero binary value in the first row 
and second column position of the equivalence metadata summary. 

25, A program product for use in a computer system that executes program steps recorded 
in a computer-readable media to perform a method for classifying electronically posted documents, 
the program product comprising: 

a recordable media; 

a program of computer-readable instructions executable by the computer system to 
perform method steps comprising: 

receiving a plurality of documents; 

generating a respective plurality of metadata summaries corresponding to the 
plurality of received documents; 

grouping a first subset of the respective pluraUty of metadata summaries into a 
first sununary group, the first summary group comprising summaries having a first mime-type 
designation; 

selecting a first metadata summary and a second metadata summary firom the 
first summary group, wherein the first metadata summary includes a first summary sub-tree and the 
second metadata summary includes a second summary sub-tree; 

comparing the structure of the first summary sub-tree with the structure of the 
second summary sub-tree; and 

identifying the first and second documents as distinct if the structures of the 
first and second summary sub-trees are not equivalent. 

26. The program product of claim 25, wherein the step of grouping further comprises the 
step of grouping a second subset of the respective metadata summaries into a second summary 
group, the second summary group comprising summaries having a second mime-type designation. 
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SYSTEM AND METHOD FOR CLASSIFYING 
ELECTRONICALLY POSTED DOCUMENTS 

ABSTRACT OF THE DISCLOSURE 

A method for classifying electronically posted documents includes receiving two posted 
documents and generating corresponding metadata summaries for each, wherein each of the 
metadata summaries includes at least one sub-tree structure. The structures of the two summary sub- 
trees within the respective metadata summaries are subsequently compared. If the two summary 
sub-trees are different, the two documents are deemed distinct. If the two summary sub-trees are the 
same, attribute values and text content of the metadata summaries are compared ovei; a portion of the 
metadata summaries. If the compared attribute values and text content are determined to be the 
same, the documents are deemed duphcative. 

201949 vOl.PA (4BTP0H.DOC) 
1/26/00 2:23 PM (11 886.7008) 
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<?xml version="LO"?> 

<rdf;RDF xmlns :rdf=="http://www. w3 .org/scheiTias/rdf-scheina''> 
<!-- 

This RDF Description contains information about (1) the data gatherer and 
(2) the metadata of the data source. 

-> 

r <rdf:Description gatherer="Grand Central Station Gatherer 11" 
)\0 \ gathered-on="Tue Mar 23 17:38:40 GMT 1999" 

\ , summarizer="comibm,alnmdeagcs.sunimarizer.HTMLSuiimiaryMaker" 
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resource="http://www,people,com/jane_doe/homepage.htral" 

source-last-modified="Tue Feb 17 15:43:58 GMT 1999" 

mime-type="http/html" 

content-length="46220" 

source- is="Good" 

comments="good"/> 

This RDF Description contains information about from the data source itself. 
In addition to textual information, it contains structural information, for example 
the URLs that the page points to. ' . * 



^ <rdf:Description html-title="Jane's Homepage" ^ 

html-encoding="8859_l" - 2. n 
abstract="Peisonal homepage of Jane Doe"> ^ ^ ^ ^ 

<!- 

The "ref-annotations" contains summaries of the out-links of the HTML 
page. In this case, the attribute "ref * gives the URL of the referenced 
page. The attribute "annotation" gives the text associated with each 
out-link. 

— > 

'<ref-annotations> 
<rdf:Bag> 

<rdf:LI> q 

<rdf:Description ) 

/II — ^ 



1 



re^"http://www,yahoo.com/' 
annotation="Yahoo!"/> '-^ 
</rdf:LI> _ ^ 

<rdf:LI> 

<rdf:Description 

ref="http://www.peoplexoni/jane_doe/my_photojpg 
annotation="picture of me."/> 
</rdf;LI> / 
</rdf:Bag> ZS"^ 
^/ref-annotations> 
<!- 

The "presentation-text" contains the textual content of the HTML page 
that may be seen through a WWW browser. 

-> 

^presentation-text> 
<rdf:Bag> 

<rdf:LI>Welcome to my homepage. </rdf:LI> 
<rdf:LI>Use Yahoo! to search for something or look 
at a picture of me.</rdf:LI> 

</rdf:Bag> 
. </presentation-text> 
</rdf:Description> 
</rdf:RDF> 
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th^st paragraph of Title 35, United States Code, § 1 12, 1 acknowledge the duty to disclose information which is material to patentability 
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